Skip to content

Fix topology scheduling rule: anti-affinity, feature gates, PDB guidance#18

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
sanchezl:fix-topology-scheduling-rule
Jun 4, 2026
Merged

Fix topology scheduling rule: anti-affinity, feature gates, PDB guidance#18
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
sanchezl:fix-topology-scheduling-rule

Conversation

@sanchezl

@sanchezl sanchezl commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes and improvements to the "Topology-Aware Scheduling Compatibility" review rule based on findings from OCPBUGS-65984 investigation and a cross-operator scheduling survey.

Bug fixes

  • Wrong feature gate name: DualReplicaTopologyDualReplica (the enhancement proposed DualReplicaTopology but the implementation in openshift/api/features/features.go uses DualReplica)
  • Wrong label prefix: role.kubernetes.io/arbiternode-role.kubernetes.io/arbiter (missing node- prefix)
  • Incorrect feature gate claims: HighlyAvailableArbiter was listed as requiring a feature gate, but it is ungated in the API enum. DualReplica was listed as tech preview but is now enabled by default.

Anti-affinity guidance reversal

The previous rule recommended preferredDuringSchedulingIgnoredDuringExecution as the safe alternative to required anti-affinity. This recommendation directly caused OCPBUGS-65984: KSVMO#149 followed it, both migrator pods landed on the same node during a bootstrap timing race, and a 74-second Available=False outage occurred when that node was fenced. Every other HA operator (oauth-openshift, openshift-apiserver, image-registry, monitoring) uses required anti-affinity and survived the same fencing event.

The rule now correctly identifies the dangerous combination as required anti-affinity + maxUnavailable: 0 (which deadlocks), not required anti-affinity alone. The recommendation is to use required anti-affinity with maxUnavailable >= 1.

Version-agnostic feature gate references

Feature gate status (TechPreview vs Default) changes between releases, and this rule applies across all branches. Removed hardcoded "tech preview" labels and specific feature gate names. Instead points to openshift/api as the source of truth for the target release.

New guidance

  • PDB limitation: PDBs do not protect against TaintManagerEviction (node fencing/unreachable taints) — only the eviction API (kubectl drain).
  • library-go helpers: Recommends WithTopologyAwareReplicasHook, WithTopologyAwareSchedulingHook, and WithControlPlaneNodeSelectorHook from library-go#2276 for DeploymentController-based operators.

References

- Fix anti-affinity recommendation: required anti-affinity with
  maxUnavailable >= 1 is the correct pattern (used by oauth, openshift-
  apiserver, image-registry, monitoring). Preferred anti-affinity allows
  pod co-location, which caused OCPBUGS-65984 (74-second Available=False
  on two-node fencing when both pods landed on the same node).

- Flag the dangerous combination (required anti-affinity + maxUnavailable: 0)
  instead of flagging required anti-affinity alone.

- Make feature gate references version-agnostic: remove hardcoded
  "tech preview" status and specific feature gate names that change
  between releases. Point to openshift/api as the source of truth.

- Fix wrong feature gate name: DualReplicaTopology -> DualReplica.

- Fix wrong label: role.kubernetes.io/arbiter -> node-role.kubernetes.io/arbiter.

- Add PDB limitation: PDBs do not protect against TaintManagerEviction
  (node fencing), only the eviction API (kubectl drain).

- Recommend library-go topology-aware scheduling hooks for
  DeploymentController-based operators.
@openshift-ci openshift-ci Bot requested review from joelanford and mkowalski June 4, 2026 18:11
@sanchezl

sanchezl commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

/cc @dgoodwin

@openshift-ci openshift-ci Bot requested a review from dgoodwin June 4, 2026 18:15
@dgoodwin

dgoodwin commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

/lgtm

Thanks Luis!

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 4, 2026
@openshift-ci

openshift-ci Bot commented Jun 4, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, sanchezl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 4, 2026
@openshift-merge-bot openshift-merge-bot Bot merged commit dc5b4f5 into openshift:main Jun 4, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants